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Abstract 

Efficient planning plays a crucial role in 
model-based reinforcement learning. Tradi- 
tionally, the main planning operation is a 
full backup based on the current estimates of 
the successor states. Consequently, its com- 
putation time is proportional to the num- 
ber of successor states. In this paper, we 
introduce a new planning backup that uses 
only the current value of a single successor 
state and has a computation time indepen- 
dent of the number of successor states. This 
new backup, which we call a small backup, 
opens the door to a new class of model-based 
reinforcement learning methods that exhibit 
much finer control over their planning process 
than traditional methods. We empirically 
demonstrate that this increased flexibility al- 
lows for more efficient planning by showing 
that an implementation of prioritized sweep- 
ing based on small backups achieves a sub- 
stantial performance improvement over clas- 
sical implementations. 



1. Introduction 

In reinforcement learning (RL) (Kaelbling et al., 1996; 
Sutton & Barto, 1998), an agent seeks an optimal con- 
trol policy for a sequential decision problem in an ini- 
tially unknown environment. The environment pro- 
vides feedback on the agent's behavior in the form 
of a reward signal. The agent's goal is to maximize 
the expected return, which is the discounted sum of 
rewards over future timesteps. An important perfor- 
mance measure in RL is the sample efficiency, which 
refers to the number of environment interactions that 
is required to obtain a good policy. 

Many solution strategies improve the policy by itera- 
tivcly improving a state-value or action-value function, 
which provide estimates of the expected return under 
a given policy for (environment) states or state-action 



pairs, respectively. Different approaches for updating 
these value functions exist. In terms of sample effi- 
ciency, one of the most effective approaches is to esti- 
mate the environment model using observed samples 
and to compute, at each time step, the (action-)valuc 
function that is optimal with respect to the model es- 
timate using planning techniques. A popular planning 
technique used for this is value iteration (VI) (Sutton, 
1988; Watkins, 1989), which performs sweeps of back- 
ups through the state or state-action space, until the 
(action-)value function has converged. 

A drawback of using VI is that it is computation- 
ally very expensive, making it infeasible for many 
practical applications. Fortunately, efficient approx- 
imations can be obtained by limiting the number of 
backups that is performed per timestep. A very ef- 
fective approximation strategy is prioritized sweep- 
ing (Moore & Atkeson, 1993; Peng & Williams, 1993), 
which prioritizes backups that are expected to cause 
large value changes. This paper introduces a new 
backup that enables a dramatic improvement in the 
efficiency of prioritized sweeping. 

The main idea behind this new backup is as following. 
Consider that we are interested in some estimate A 
that is constructed from a sum of other estimates Xi . 
The estimate A can be computed using a full backup: 



A 



If the estimates Xi are updated, A can be recomputed 
by redoing the above backup. Alternatively, if we know 
that only Xj received a significant value change, we 
might want to update A for only Xj. Let us indicate 
the old value of Xj , used to construct the current value 
of A, as Xj . A can then be updated by subtracting this 
old value and adding the new value: 



At- A 



Xi 



This kind of backup, which we call a small backup, is 
computationally cheaper than the full backup. The 
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trade-off is that, in general, more memory is required 
for storing the estimates x% associated with A. In plan- 
ning, where the X estimates correspond to state- value 
estimates and A corresponds to a state or state-action 
estimate, this is not a serious restriction, because a 
full model is stored already. The additional memory 
required has the same order of complexity as the mem- 
ory required for storage of the model. 

The core advantage of small backups over full back- 
ups is that they enable finer control over the plan- 
ning process. This allows for more effective update 
strategies, resulting in improved trade-offs between 
computation time and quality of approximation of 
the VI solution (and hence sample efficiency). We 
demonstrate this empirically by showing that a prior- 
itized sweeping implementation based on small back- 
ups yields a substantial performance improvement over 
the two classical implementations (Moore & Atkeson, 
1993; Peng & Williams, 1993). 

In addition, we demonstrate the relevance of small 
backups in domains with severe constraints on com- 
putation time, by showing that a method that per- 
forms one small backup per time step has an equal 
computation time complexity as TD(0), the classical 
method that performs one sample backup per timestep. 
Since sample backups introduce sampling variance, 
they require a step-size parameter to be tuned for 
optimal performance. Small backups, on the other 
hand, do not introduce sampling variance, allowing 
for a parameter-free implementation. We empirically 
demonstrate that the performance of a method that 
performs one small backup per time step is similar to 
the optimal performance of TD(0), achieved by care- 
fully tuning the step-size parameter. 

2. Reinforcement Learning Framework 

RL problems are often formalized as Markov decision 
processes (MDPs), which can be described as tuples 
(S, A, V, 1Z, 7} consisting of S, the set of all states; A, 
the set of all actions; V s sa = Pr(s'\s,a), the transition 
probability from state s € S to state s' when action 
a G A is taken; lZ sa = E{r\s, a}, the reward function 
giving the expected reward r when action a is taken 
in state s; and 7, the discount factor controlling the 
weight of future rewards versus that of the immediate 
reward. 

Actions are selected at discrete timesteps t = 0, 1, 2, ... 
according to a policy tt : S X A — > [0, 1], which defines 
for each action the selection probability conditioned 
on the state. In general, the goal of RL is to improve 
the policy in order to increase the return G, which is 



the discounted cumulative reward 

00 

G t = r t+ i +-ir t+2 + -l 2 r t+s + ... = y] 7 fe_1 r t+k , 

k=l 

where r t +i is the reward received after taking action 
at in state St at timestep t. 

The prediction task consists of determining the value 
function which gives the expected return when 

policy 7r is followed, starting from state s. V v (s) can 
be found by making use of the Bellman equations for 
state values, which state the following: 

V*(s)=K s +-yY / ' P sV*(s'), (1) 

a' 

where TZ S =J2 a 7r(s, a)K sa and V S J =Y, a tt(s, a)V s s ' a . 

Model-based methods use samples to update estimates 
of the transition probabilities, V 8 S , and reward func- 
tion, 1Z S . With these estimates, they can iteratively 
improve an estimate V of V r , by performing full back- 
ups, derived from Equation (1): 

V( S )^K s+1 J2'P!'v(s r ). (2) 

s' 

In the control task, methods often aim to find the op- 
timal policy 7r* , which maximizes the expected return. 
This policy is the greedy policy with respect to the op- 
timal action-value function Q*(s,a), which gives the 
expected return when taking action a in state s, and 
following 7r* thereafter. This function is the solution 
to the Bellman optimality equation for action- values: 

Q*(s,a) = K sa + iy2vf a ma,xQ*(s',a'). (3) 

* — ' a' 

s' 

The optimal value function is related to the op- 
timal action- value function through: V*(s) = 
m&x a Q*(s,a). 

Model-based methods can iteratively improve esti- 
mates Q of Q* by performing full backups derived from 
Equation (3): 

Q(s, a)^il sa + 7 VP 8 8 1 max Q(s',a'), (4) 

£ — ' a' 

where 7?. sa and V% a are estimates of lZ sa and V s sa , re- 
spectively. 

Model-free methods do not maintain an model esti- 
mate, but update a value function directly from sam- 
ples. A classical example of a sample backup, based on 
sample (s,r, s') is the TD(0) backup: 

V(s)^V( S ) + a(r + 1 V(s / )-V(s)), (5) 

where a is the step-size parameter. 
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3. Small Backup 

This section introduces the small backup. We start 
with small state- value backups for the prediction task. 
Section 3.3 discusses small action- value backups for 
the control task. 

3.1. Value Backups 

In this section, we introduce a small backup version 
of the full backup for prediction (backup 2). In the 
introduction, we showed that a small backup requires 
storage of the component values that make up the cur- 
rent value of a variable. In the case of a small value 
backup, the component values correspond to the val- 
ues of successor states. We indicate these values by 
the function U s : S x S — > H. So, U s (s') is the value 
estimate of a' associated with s. 

Using U s , V(s) can be updated with only the current 
value of a single successor state, a', as demonstrated 
by the following theorem. The three backups shown 
in the theorem form together the small backup. 

Theorem 3.1 If the current relation between V(s) 
and U s is given by 



V( S )=K s+1 J2' ps s "u s ( S "), 



(0) 



then, after performing the following backups: 

tmp <- V(s') (7) 

V(s) <- V(s) + 1 V!'[V(s')-U s (s')] (8) 

U s (s') <- tmp, (9) 

relation (6) still holds, but U s (s') is updated to V(s'). 

Proof Backup (8) subtract the component in relation 
(6) corresponding to s' from V(s) and adds a new com- 
ponent based on the current value estimate of s': 

V{a) <- V{s) - -fP s s 'u s (s') + jV s s 'v(s') . 

Hence, relation (6) is maintained, while U s (s') is up- 
dated. Note that V(s') needs to be stored in a tempo- 
rary variable, since backup (8) can alter the value of 
V(s') if a' = s. | 

3.2. Value Correction after Model Update 

Theorem 3.1 relies on relation (6) to hold. If the 
model gets updated, this relation now longer holds. 
In this section, we discuss how to restore relation (6) 
in a computation-efficient way for the commonly used 
model estimate: 



V s 
K K 



<— 



N 

R sur 



(10) 
(11) 



where N s counts the number of times state s is vis- 
ited, Ng counts the number of times a' is observed as 
successor state of s, and R s s um is the sum of observed 
rewards for s. 

Theorem 3.2 If currently, the following relation 
holds: 

7( S )=£ s + 7 ^PfW), 

s" 

and a sample (s,r, s') is observed, then, after perform- 
ing the backups: 



N s <- JV. + l: 



N! 



1 



(12) 



V(s) <- [V(s)(N s - 1) + r + <yU s (a')\ /N B . (13) 

the relation still holds, but with updated values for TZ S 
and V\ ■ 

Proof (sketch) Backup (13) updates V(s) by com- 
puting a weighted average of V(s) and r + jU s (a'). 
The value change this causes is the same as the value 
change caused by updating the model and then per- 
forming a full backup of s based on U s . 

Algorithm 1 shows pseudo-code for a general class of 
prediction methods based on small backups. Surpis- 
ngly, while it is a planning method, 1Z S is never explic- 
itly computed, saving time and memory. Note that the 
computation per time step is fully independent of the 
number of successor states. Members of this class need 
to specify the number of iterations (line 8) as well as 
a strategy for selecting state-successor pairs (line 9). 

Algorithm 1 Prediction with Small Backups 



1 

2 
3 
4 
5 
6 

7 

8 
9 

10 
11 
12 
13 
14 



initialize V(a) arbitrarily for all s 
initialize U s (s') = V(s') for all s,s' 
initialize N S ,N^ to for all s, s' 
loop {over timesteps} 

observe transition (s, r, a') 

N s <- N s + 1; Nf <- N*' + 1 

V(8)<- [v(a)(N s -l) + r + jU s (a')]/N s 
loop {for a number of iterations} 

select a pair (s, s') with N§ > 

tmp «— V(a') 

V(a) f- V(a) + jNl'/N s ■ [V(a') - U s {s')} 
U s (s') <— tmp 
end loop 
end loop 



3.3. Action-value Backups 

Before we can discuss small action-value backups, we 
have to discuss a more efficient implementation of the 
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full action-value backup. Backup (4) has a compu- 
tation time complexity of 0(|<S||.A|). A more effi- 
cient implementation can be obtained by storing state- 
values, besides action-values, according to V(s) = 
max a Q(s, a). Backup (4) can then be implemented 
by: 



Q{s,a) <- K sa + 7 J2 V ( S ') ( 14 ) 

s' 

V{s) <- max Q(s, a). (15) 

a' 

The combined computation time of these backups is 
0(\S\ + \A\), a considerable reduction. 

Backup (14) is similar in form as the prediction 
backup. Hence, we can make a small backup version of 
it similar to the one in the prediction case. The theo- 
rems below are the control versions of the theorems for 
the prediction case. They can be proven in a similar 
way as the prediction theorems. 

Theorem 3.3 If the current relation between Q(s,a) 
and U sa is given by 



Q( S ,a)^K sa+1 V!aJ2u sa ( S "), 



then, performing the following backups: 



(16) 



Q(s,a) 
U sa (s') 



Q( S ,a) +1 P s s ' a lV(s')-U sa (s')} 
V(s'), 



maintains this relation while updating U sa (s') to V(s'). 

Theorem 3.4 If relation (16) holds and a sample 
(s,a,r,s') is observed, then, after performing backups 



N sa 
Q{s,a) 



N s 



Ni 



Q(s,a)(N sa -l)+r + jU sa (s') /N, 



relation (16) still holds, but with updated values for 
1l sa and V s sa . 

A small action- value backup is a finer-grained version 
of backup (14): performing a small backup of Q(s,a) 
for each successor state is equivalent (in computation 
time complexity and effect) as performing backup (14) 
once. While in principle, backup (15) can be per- 
formed after each small backup, it is not very effi- 
cient to do so, since small backups make many small 
changes. More efficient planning can be obtained when 
backup (15) is performed only once in a while. 

In Section 4, we discuss an implementation of priori- 
tized sweeping based on small action- value backups. 



3.4. Small Backups versus Sample Backups 

A small backup has in common with a sample backup 
that both update a state value based on the current 
value of only one of the successor states. In addition, 
they share the same computation time complexity and 
their effect is in general smaller than that of a full 
backup. 

A disadvantage of a sample backup, with respect to a 
small backup, is that it introduces sampling variance, 
caused by a stochastic environment. This requires the 
use of a step-size parameter to enable averaging over 
successor states (and rewards). A small backup does 
not introduce sampling variance, since it is implicitly 
based on an expectation over successor states. Hence, 
it does not require tuning of a step-size parameter for 
optimal performance. 

A second disadvantage of a sample backup is that it af- 
fects the perceived distribution over action outcomes, 
which places some restrictions on reusing samples. For 
example, a model-free technique like experience replay 
(Lin, 1992), which stores experience samples in order 
to replay them at a later time, can introduce bias, 
which reduces performance, if some samples are re- 
played more often than others. For small backups this 
does not hold, since the process of learning the model 
is independent from the backups based on the model. 
This allows small backups to be combined with effec- 
tive selection strategies like prioritized sweeping. 

4. Prioritized Sweeping with Small 
Backups 

Prioritized sweeping (PS) makes the planning step of 
model-based RL more efficient by using a heuristic (a 
'priority') for selecting backups that favours backups 
that are expected to cause a large value change. A pri- 
ority queue is maintained that determines which values 
are next in line for receiving backups. 

There are two main implementations: one by 
Moore & Atkeson (1993) and one by Peng & Williams 
(1993) 1 . All PS methods have in common that they 
perform backups in what we call update cycles. By ad- 
justing the number of update cycles that is performed 
per time step, the computation time per time step can 
be controlled. Below, we discuss in detail what occurs 
in an update cycle for the two classical PS implemen- 
tations. 



x We refer to the version of 'queue-Dyna' for stochastic 
domains, which is different from the version for determin- 
istic domains. 
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4.1. Classical Prioritized Sweeping 
Implementations 

In the Moore & Atkeson implementation the elements 
in the queue are states and the backups are full value 
backups. In control, a full value backup is different 
from backup (2). Instead, it is equivalent (in effect and 
computation time) to performing backup (14) for each 
action, followed by backup (15). Hence, the associated 
computation time has complexity 0(|iS||„4 + \A\). 

An update cycle consists of the following steps. First, 
the top state is removed from the queue, and receives 
a full value backup. Let s bet the top state and AV S 
the value change caused by the backup. Then, for 
all predecessor state-action pairs (s, a) a priority p is 
computed, using: 

p^V§- a -\AV s \. (17) 

If s is not yet on the queue, then it is added with 
priority p. If s is on the queue already, but its cur- 
rent priority is smaller than p, then the priority of s is 
upgraded to p. 

The Peng & Williams implementation differs from the 
Moore & Atkeson implementation in that the backup 
is not a full value backup. Instead, it is a backup 
with the same effect as a small action-value backup, 
but with a computational complexity of 0(\S\ + \A\). 
So, it is a cheaper backup than a full backup, but its 
value change is (much) smaller. The backup requires a 
state-action-successor triple. Hence, these triples are 
the elements on the queue. Predecessors are added to 
the queue with a priorities that estimate the action- 
value change. 

4.2. Small Backup implementation 

A natural small backup implementation might ap- 
pear to be an implementation similar to that of Peng 
& Williams, but with the main backup implemented 
more efficiently. The low computational cost of a small 
backup, however, allows for a much more powerful im- 
plementation. The pseudo-code of this implementa- 
tion is shown in Algorithm 2. Below, we discuss some 
key characteristics of the algorithm. 

The computation time of a small backup is so low, 
that it is comparable to the priority computation in 
the classical PS implementations. Therefore, instead 
of computing a priority for each predecessor and per- 
forming a backup for the clement with the highest pri- 
ority in the next update cycle, we can perform a small 
backup for all predecessors. This raises the question 
of what to put in the priority queue and what type of 
backup to perform for the top element. The natural 



answer is to put states in the priority queue and to 
perform backup (15) for the top state. 

The priority associated with a state is based on the 
change in action-value that has occurred due to small 
backups, since the last value backup. This priority as- 
sures that states with a large discrepancy between the 
state value and action- values, receive a value backup 
first. 

One surprising aspect of the algorithm is that it does 
not use the function U sa , which forms an essential part 
of small action- value backups. The reason is that due 
to the specific backup strategy used by the algorithm, 
Usa(s') is equal to V(s') for all state-action pairs (s, a) 
and all successor states s'. Hence, instead of using 
U sa , V can be used, saving memory and simplifying 
the code. 

Table 1 shows the computation time complexity of 
an update cycle for the different PS implementations. 
The small backup implementation is the cheapest one 
among the three. 





top-clement 
backups 


other 


Moore & Atkeson 
Peng & Williams 
small backups 


0(\S\\A\ + \A\) 
0(\S\ + \A\) 
0(\A\) 


O(Pre) 
0(P re ) 

0(P re ) 



Table 1. Computation time associated with one update cy- 
cle for the different PS implementations. P re indicates the 
number of predecessors, state-action pairs that transition 
to the state whose value has just been updated. 



5. Experimental Results 

In this section, we evaluate the performance of a mini- 
mal version of Algorithm 1, as well as the performance 
of Algorithm 2. 

5.1. Small backup versus Sample backup 

We compare the performance of TD(0), which per- 
forms one sample backup per time step, with a version 
of Algorithm 1 that performs one small backup per 
time step. Specifically, its number of iterations (line 
8) is 1, and the selected state-successor pair (line 9) is 
the pair corresponding to the most recent transition. 

Their performance is compared on two evaluation 
tasks, both consisting of 10 states, laid out in a cir- 
cle. State transitions only occur between neighbours. 
The transition probabilities for both tasks are gener- 
ated by a random process. Specifically, the transition 
probability to a neighbour state is generated by a ran- 
dom number between and 1 and normalized such that 
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Algorithm 2 Prioritized Sweeping with Small Back- 
ups 

1: initialize V(s) arbitrarily for all s 

2: initialize Q(s,a) = Q pre v{s,a) = V(s) for all s, a 

3: initialize N sa , N^ a to for all s, a, s' 

4: loop {over episodes} 

5: initialize s 

6: repeat {for each step in the episode} 
7: select action a, based on Q(s, •) 
8: take action a, observe r and s' 
9: iV sa <- N sa + 1; JV£ <- iV^ + 1 
10: Q(s,a) <- [Q(s,a)(iV sa -l)+r+7y( S ')]/^ S a 

11: p «- |Q(s, a) — QpreuCs, a)| 

12: if s not on queue or p > current priority s, 

then promote s to p 

13: for a number of update cycles do 

14: remove top state s' from queue 

15: for all b: Q pre v(s', b) «- Q(s', b) 

16: imp «- V(s') 

17: V(s') 4- max 6 Q(s',6) 

18: <- V(s') - imp 

19: for all (s, a) pairs with N~ > do 

20: Q(«, o) <- Q(S, o) + jN§ n /N m ■ AV 

21: p «- |Q(s, a) - Qprev(S, a)\ 

22: if s not on queue or p > current priority 

s, then promote s to p 

23: end for 

24: end for 

25: s<- s' 

26: until s is terminal 

27: end loop 



the sum of the transition probabilities to the left and 
right neighbour is 1. The reward for counter-clockwise 
transitions is always +1. The reward for clockwise 
transitions is different for the two tasks. In the first 
task, a clockwise transition results in a reward of -1; 
in the second task, it results in a reward of +1. The 
discount factor 7 is 0.95 and the initial state values are 
0. 

For TD(0), we performed experiments with a constant 
step-size for values between and 1 with steps of 0.02. 
In addition, we performed experiments with a decay- 
ing, state-dependent step-size, according to 



a(s) 



1 



d-(N s 



1+1 



(18) 



where N s is the number of times state s has been vis- 
ited, and d specifies the decay rate. We used values of 
d between and 1 with steps of 0.02. Note that for 
d = 0, q(s) = 1, and for d = 1, a(s) = 1/N S . 



Each time a transition is observed and the corre- 
sponding backup is performed, the root-mean squared 
(RMS) error over all states is determined. The average 
RMS error over the first 10.000 transitions, normal- 
ized with the initial error, determines the performance. 
Figure 1 shows this performance, averaged over 100 
runs. The standard error is negligible: the maximum 
standard error in the first task was 0.0057 (after nor- 
malization) and in the second task 0.0007. Note that 
the performance for d = is equal to the performance 
for a = 1, as it should, by definition. The normalized 
performance for a = is 1, since no learning occurs in 
this case. 

These experiments demonstrate three things. First, 
the optimal step-size can vary a lot between different 
tasks. Second, selecting a sub-optimal step-size can 
cause large performance drops. Third, a small-backup, 
which is parameter-free, has a performance similar to 
the performance of TD(0) with optimized step-size. 
Since the computational complexity is the same, the 
small backup is a very interesting alternative to the 
sample backup in domains with tight constraints on 
the computation time, where previously only sample 
backups where viable. Keep in mind that a sample 
backup does require a model estimate, so if there are 
also tight constraints on the memory, a sample backup 
might still be the only option. 

5.2. Prioritized Sweeping 

We compare the performance of prioritized sweep- 
ing with small backups (Algorithm 2) with the 
two classical implementations of Moore&Atkeson and 
Peng& Williams on the maze task depicted in the top 
of Figure 2. The reward received at each time step 
is -1 and the discount factor is 0.99. The agent can 
take four actions, corresponding to the four compass 
directions, which stochastically move the agent to a 
different square. The bottom of Figure 2 shows the 
relative action outcomes of a 'north' action. In free 
space, an action can result in 15 possible successor 
states, each with equal probability. When the agent is 
close to a wall, this number decreases. 

To obtain an upper bound on the performance, we 
also compared against a method that performs value 
iteration (until convergence) at each time step, using 
the most recent model estimate. 

As exploration strategy, the agent select with 5% prob- 
ability a random action, instead of the greedy one. On 
top of that, we use the 'optimism in the face of uncer- 
tainty' principle, as also used by Moore & Atkeson. 
This means that as long as a state-action pair has not 
been visited for at least M times, it's value is defined as 
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Figure 1. Average RMS error over the first 10.000 obser- 
vations, normalized by the initial error, for different values 
of the step-size parameter a, in case of constant step-size, 
or different values of the decay parameter d, in case of de- 
caying step-size. The top graph corresponds with the first 
evaluation task; the bottom graph with the second. 



two classical implementations. The results also show 
that the Peng & Williams method performs consider- 
ably worse than the one of Moore & Atkeson in the 
considered domain. This can be explained by the dif- 
ferent backups they perform. The effect of the backup 
of Peng & Williams is proportional to the transition 
probability, which in most cases is ^ . In contrast, 
the Moore & Atkeson method performs a full backup 
each update cycle. While the small backup implemen- 
tation also uses backups that are proportional to the 
transition probability, it performs a lot more backups 
per update cycle. Specifically, a number that is pro- 
portional to the number of predecessors. In general, 
this number will increase when the stochasticity of the 
domain increases. 




some optimistically defined value (0 for our maze task), 
instead of the value based on the model estimate. We 
optimized M for the value iteration method, resulting 
in M = 4, and used this value for all methods. 

We performed experiments for 1, 3, 5 and 10 update 
cycles per time step. Figure 3 shows the average return 
over the first 200 episodes for the different methods. 
The results are averaged over 100 runs. The maximum 
standard deviation is 0.1 for all methods, except for the 
method of Peng & Williams, which had a maximum 
standard deviation of 1.0. 

The computation time per update cycle was about the 
same for the three different PS implementations, with 
a small advantage for the small backup implementa- 
tion, which shows that the 0(P re ) computation (see 
Table 1) is dominant in this task. The computation 
time per observation of the value iteration method was 
more than 400 times as high as a single update cycle. 

PS with small backups turns out to be very effective. 
With only a single update cycle, the value-iteration 
result can be closely approximated, in contrast to the 
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Figure 2. Above, the maze task, in which the agent must 
travel from S tothe G. Below, transition probabilities 
(•js) of a 'north' action for different positions of the agent 
(indicated by the circle) with respect to the walls (black 
squares) . 



6. Discussion 

Prioritized sweeping can be viewed as a generalization 
of the idea of replaying of experience in backward or- 
der (Lin, 1992), which by itself is related to eligibility 
traces (Sutton, 1988; Watkins, 1989; Sutton & Singh, 
1994). What all these techniques have in common is 
that new information (which can be value changes, but 
at its core all value changes originate from new data) is 
propagated backwards. Whereas backward replay and 
eligibility traces use the recent trajectory for backward 
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Figure 3. Performance of the PS implementations on the 
maze task for a different number of update cycles per time 
step and a method that performs value iteration at each 
time step. 



propagation of information, prioritized sweeping uses 
a model estimate for this. Hence, it propagates new 
information more broadly. 

What gives the performance edge to the small backup 
implementation is that it implements the principle of 
backward updating in a cleaner and more efficient way. 
One update cycle of Algorithm 2 represents, in a way, 
the ultimate backwards backup: all predecessors are 
updated with the current value of a chosen state, which 
is selected because it recently experienced a large value 
change. In contrast, the other PS implementation 
place the predecessors in a queue and backup only the 
state with the highest priority in the next update cy- 
cle. On top of that, the computation time per update 
cycle is lower for the small backup implementation (see 
Table 1). 

The new implementation of PS introduced in this pa- 
per would be impossible without the new backup. The 
small backup allows for very targeted updates that are 
computationally very cheap. This enables finer control 
over how computation time is spend, which is what 
drives the new PS implementation. 

7. Conclusion 

We demonstrated in this paper that the planning step 
in model-based reinforcement learning method can be 
done substantially more efficient by making use of 
small backups. These backups are finer-grained ver- 
sion of a full backup, which allow for more control over 
how the available computation time is spend. This 



makes new, more efficient, update strategies possible. 
In addition, small backups can be useful in domains 
with very tight time constraints, offering a parameter- 
free alternative to sample backups, which were up to 
now often the only feasible option for such domains. 
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